Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retrying applying CRs in the e2e jobs. #1293

Merged

Conversation

ybettan
Copy link
Collaborator

@ybettan ybettan commented Jan 12, 2025

The webhook services are not always available right after the deployments are ready. This is causing a race condition between the services being ready to get requests and the CRs applied to the cluster.

By retrying to apply the CRs we can give the services the time they need to become ready.


/assign @yevgeny-shnaidman @TomerNewman
Fixes #1291

Summary by CodeRabbit

  • Chores
    • Enhanced CI/CD scripts with improved error handling and timeout mechanisms for Kubernetes resource deployment.
    • Added retry logic to oc apply commands to handle potential transient failures.
    • Implemented timeout mechanisms for pod creation and resource application processes.

Copy link

openshift-ci bot commented Jan 12, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ybettan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link

coderabbitai bot commented Jan 12, 2025

Walkthrough

The pull request involves modifications to two Bash scripts used in Kubernetes deployment and build processes. The primary changes introduce a retry mechanism for applying resources and creating pods. Both scripts (ci/prow/e2e-hub-spoke-incluster-build and ci/prow/e2e-incluster-build) now include timeout and retry logic using bash commands that will attempt to apply resources or wait for pod creation multiple times with short sleep intervals, improving error handling and resilience during deployment.

Changes

File Change Summary
ci/prow/e2e-hub-spoke-incluster-build Added retry mechanism for oc apply -k ci/e2e-hub with 1-minute timeout and 3-second sleep intervals
ci/prow/e2e-incluster-build Added retry mechanisms for resource application and pod creation with 1-minute timeouts and sleep intervals

Assessment against linked issues

Objective Addressed Explanation
Cherry-pick commit a0f0d37bcb9b0a1a38bba1096b89aca92d42ce4e
Resolve race condition with webhook services

The changes directly address the issue by implementing a retry mechanism that allows services more time to become ready before applying Custom Resources (CRs).


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

‼️ IMPORTANT
Auto-reply has been disabled for this repository in the CodeRabbit settings. The CodeRabbit bot will not respond to your replies unless it is explicitly tagged.

  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

netlify bot commented Jan 12, 2025

Deploy Preview for openshift-kmm ready!

Name Link
🔨 Latest commit 8156f88
🔍 Latest deploy log https://app.netlify.com/sites/openshift-kmm/deploys/6783b16f679dd90008663324
😎 Deploy Preview https://deploy-preview-1293--openshift-kmm.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (5)
ci/prow/e2e-incluster-build (2)

45-45: Add error handling for timeout

The script should handle cases where the timeout is reached. Currently, if the resources can't be applied within 1 minute, the script will fail without a clear error message.

-timeout 1m bash -c 'until oc apply -k ci/e2e; do sleep 3; done'
+if ! timeout 1m bash -c 'until oc apply -k ci/e2e; do sleep 3; done'; then
+  echo "ERROR: Failed to apply resources after 1 minute of retries"
+  exit 1
+fi

Line range hint 48-51: Simplify and improve pod name matching

The current jq queries for matching pod names could be simplified and made more robust by:

  1. Using explicit label selectors with oc get pods
  2. Simplifying the jq query
-timeout 1m bash -c 'until oc get pods -o json | jq -er ".items[].metadata.name | select(.? | match(\"build\"))"; do sleep 1; done'
-POD_NAME=$(oc get pods -o json | jq -r '.items[].metadata.name | select(.? | match("build"))')
+timeout 1m bash -c 'until oc get pods -l job-name=kmm-ci-build -o name; do sleep 1; done'
+POD_NAME=$(oc get pods -l job-name=kmm-ci-build -o name | sed 's|pod/||')

-timeout 1m bash -c 'until oc get pods -o json | jq -er ".items[].metadata.name | select(.? | match(\"sign\"))"; do sleep 1; done'
-POD_NAME=$(oc get pods -o json | jq -r '.items[].metadata.name | select(.? | match("sign"))')
+timeout 1m bash -c 'until oc get pods -l job-name=kmm-ci-sign -o name; do sleep 1; done'
+POD_NAME=$(oc get pods -l job-name=kmm-ci-sign -o name | sed 's|pod/||')

Also applies to: 54-57

ci/prow/e2e-hub-spoke-incluster-build (3)

50-50: Document timeout values

The script uses various timeout values (1m, 3m, 15m) for different operations. Consider adding comments explaining the rationale behind these timeout values to help maintainers understand and adjust them if needed.

+# Retry applying resources for up to 1 minute to handle webhook service availability
 timeout 1m bash -c 'until oc apply -k ci/e2e-hub; do sleep 3; done'

Line range hint 52-54: Standardize build wait patterns

The build wait patterns could be standardized with the pod wait patterns from the other script for consistency.

-timeout 1m bash -c 'until oc -n ${HUB_OPERATOR_NAMESPACE} get builds -o json | jq -er ".items[].metadata.name | select(.? | match(\"build\"))"; do sleep 1; done'
-export build_build=$(oc -n ${HUB_OPERATOR_NAMESPACE} get builds -o json | jq -r '.items[].metadata.name | select(.? | match("build"))')
+timeout 1m bash -c 'until oc -n ${HUB_OPERATOR_NAMESPACE} get builds -l job=kmm-ci-build -o name; do sleep 1; done'
+export build_build=$(oc -n ${HUB_OPERATOR_NAMESPACE} get builds -l job=kmm-ci-build -o name | sed 's|build/||')

-timeout 1m bash -c 'until oc -n ${HUB_OPERATOR_NAMESPACE} get builds -o json | jq -er ".items[].metadata.name | select(.? | match(\"sign\"))"; do sleep 1; done'
-export sign_build=$(oc -n ${HUB_OPERATOR_NAMESPACE} get builds -o json | jq -r '.items[].metadata.name | select(.? | match("sign"))')
+timeout 1m bash -c 'until oc -n ${HUB_OPERATOR_NAMESPACE} get builds -l job=kmm-ci-sign -o name; do sleep 1; done'
+export sign_build=$(oc -n ${HUB_OPERATOR_NAMESPACE} get builds -l job=kmm-ci-sign -o name | sed 's|build/||')

Also applies to: 57-59


50-50: Consider extracting common retry patterns

Both scripts implement similar retry patterns for oc apply and resource waiting. Consider extracting these patterns into helper functions in a shared script to reduce duplication and ensure consistency.

Example shared functions:

# helpers.sh
wait_for_resource() {
    local namespace=$1
    local resource_type=$2
    local selector=$3
    local timeout=${4:-1m}
    
    if ! timeout "${timeout}" bash -c "until oc -n ${namespace} get ${resource_type} ${selector} &>/dev/null; do sleep 1; done"; then
        echo "ERROR: Timeout waiting for ${resource_type} in namespace ${namespace}"
        return 1
    fi
}

retry_apply() {
    local path=$1
    local timeout=${2:-1m}
    
    if ! timeout "${timeout}" bash -c "until oc apply -k ${path}; do sleep 3; done"; then
        echo "ERROR: Failed to apply resources at ${path} after ${timeout} of retries"
        return 1
    fi
}
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 36447e9 and 0835c06.

📒 Files selected for processing (2)
  • ci/prow/e2e-hub-spoke-incluster-build (1 hunks)
  • ci/prow/e2e-incluster-build (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: Analyze (go)

ci/prow/e2e-incluster-build Show resolved Hide resolved
The webhook services are not always available right after the
deployments are ready. This is causing a race condition between the
services being ready to get requests and the CRs applied to the cluster.

By retrying to apply the CRs we can give the services the time they need
to become ready.

Signed-off-by: Yoni Bettan <[email protected]>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
ci/prow/e2e-hub-spoke-incluster-build (1)

Line range hint 1-54: Consider architectural improvements for better reliability.

The script has several areas where reliability could be improved:

  1. Inconsistent timeout durations (1m vs 3m vs 15m)
  2. Limited error handling and reporting
  3. No cleanup on failure

Consider these improvements:

  1. Define timeout constants at the top of the script:
readonly RESOURCE_TIMEOUT="5m"
readonly BUILD_TIMEOUT="10m"
readonly ACM_TIMEOUT="15m"
  1. Add a trap for cleanup:
cleanup() {
  # Add cleanup logic here
  echo "Cleaning up resources..."
}
trap cleanup EXIT
  1. Add error reporting function:
report_error() {
  local component=$1
  local message=$2
  echo "Error in ${component}: ${message}"
  # Add logging or metric collection
}
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0835c06 and 8156f88.

📒 Files selected for processing (2)
  • ci/prow/e2e-hub-spoke-incluster-build (1 hunks)
  • ci/prow/e2e-incluster-build (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • ci/prow/e2e-incluster-build
⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: Analyze (go)
🔇 Additional comments (1)
ci/prow/e2e-hub-spoke-incluster-build (1)

54-54: Verify the timeout and retry intervals.

The retry mechanism is a good addition to handle race conditions with webhook services. However, consider the following aspects:

  1. The 1-minute timeout might be too short for larger deployments.
  2. The 3-second sleep interval between retries might lead to unnecessary load on the API server.

Consider this improved version with exponential backoff:

-timeout 1m bash -c 'until oc apply -k ci/e2e-hub; do sleep 3; done'
+timeout 5m bash -c '
+  retries=0
+  until oc apply -k ci/e2e-hub; do
+    retries=$((retries + 1))
+    if [ $retries -gt 10 ]; then
+      echo "Failed to apply resources after 10 retries"
+      exit 1
+    fi
+    sleep $((2 ** retries))
+  done
+'

Let's verify the resource application patterns in the codebase:

✅ Verification successful

The timeout and retry intervals are consistent with established patterns.

The current implementation using a 1-minute timeout and 3-second sleep interval aligns with similar resource application patterns throughout the codebase. While exponential backoff could be an improvement, the current approach is consistent with the project's conventions and sufficient for the use case.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check for similar resource application patterns that might need the same retry mechanism
rg -A 2 "oc apply" --type bash

Length of output: 61


Script:

#!/bin/bash
# Search for oc apply patterns with context
rg -C 3 "oc apply"

# Search for timeout patterns
rg -C 2 "timeout.*bash"

# Search for retry/until patterns
rg -C 2 "until.*do"

Length of output: 42582

@ybettan
Copy link
Collaborator Author

ybettan commented Jan 12, 2025

/retest

2 similar comments
@ybettan
Copy link
Collaborator Author

ybettan commented Jan 12, 2025

/retest

@ybettan
Copy link
Collaborator Author

ybettan commented Jan 12, 2025

/retest

@TomerNewman
Copy link
Member

/lgtm

@openshift-ci openshift-ci bot added the lgtm label Jan 13, 2025
@openshift-merge-bot openshift-merge-bot bot merged commit d80a1e4 into rh-ecosystem-edge:main Jan 13, 2025
20 checks passed
@ybettan ybettan deleted the retrying-cr-apply branch January 13, 2025 08:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cherry-picking error for a0f0d37bcb9b0a1a38bba1096b89aca92d42ce4e
3 participants